-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBS: Replace badger with pebble #15235
Conversation
This pull request does not have a backport label. Could you fix it @carsonip? 🙏
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work!
This PR replaces badger with pebble as the database for tail-based sampling. Significant performance gains. The database of choice is Pebble, which does not have TTL handling built-in, and we implement our own TTL handling on top of the database: - TTL is divided up into N parts, where N is partitionsPerTTL. - A database holds N + 1 + 1 partitions. - Every TTL/N we will discard the oldest partition, so we keep a rolling window of N+1 partitions. - Writes will go to the most recent partition, and we'll read across N+1 partitions (cherry picked from commit 0ca58b8) # Conflicts: # go.mod # go.sum # internal/beater/monitoringtest/opentelemetry.go # x-pack/apm-server/main.go # x-pack/apm-server/main_test.go # x-pack/apm-server/sampling/processor.go # x-pack/apm-server/sampling/processor_bench_test.go # x-pack/apm-server/sampling/processor_test.go
…sor config (#15488) Fix a regression from #15235 where storage_limit does not follow processor lifecycle. Remove storage limit from processor config. Add storage to processor config validation. (cherry picked from commit dcb08ac) # Conflicts: # x-pack/apm-server/main.go # x-pack/apm-server/sampling/config.go # x-pack/apm-server/sampling/config_test.go # x-pack/apm-server/sampling/eventstorage/rw.go # x-pack/apm-server/sampling/eventstorage/rw_test.go # x-pack/apm-server/sampling/eventstorage/storage_bench_test.go # x-pack/apm-server/sampling/eventstorage/storage_manager.go # x-pack/apm-server/sampling/eventstorage/storage_manager_bench_test.go # x-pack/apm-server/sampling/eventstorage/storage_manager_test.go # x-pack/apm-server/sampling/processor.go # x-pack/apm-server/sampling/processor_test.go
…sor config (#15488) (#15491) Fix a regression from #15235 where storage_limit does not follow processor lifecycle. Remove storage limit from processor config. Add storage to processor config validation. (cherry picked from commit dcb08ac) Co-authored-by: Carson Ip <[email protected]>
Fix a missing colon in logs (typo from #15235 ), and remove "storage" in "configured storage limit reached" message to make way for #15467 to avoid confusion (cherry picked from commit 28068bd) Co-authored-by: Carson Ip <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Motivation/summary
This PR replaces badger with pebble as the database for tail-based sampling.
Benchmarks
TLDR: +3000% in indexed events/s, +76% intakev2 event rate, while on -75% memory usage and -44% disk usage
See comment for details.
Major design changes
2*TTL
. In fact, there's a knobpartitionsPerTTL
to adjust the available prefixes to trade between storage overhead and read amplification. e.g. partitionsPerTTL=1 keeps2*TTL
entries with 2 partition reads per key read, while partitionsPerTTL=2 keeps1.5*TTL
entries with 3 partition reads per key read.Other implied changes
sampling.tail.storage.lsm_size
, whilesampling.tail.storage.vlog_size
is always 0. The change is not decided yet.TODO:
sampling.tail.storage_limit
and storage limit handling #14933Check for memory leak by running it for a long time, as pebble does manual memory managementevent loss on upgrade:see Migration path from badger to pebble #15423Useful but not necessary, out of scope of this PR:
testing/infra/terraform/modules/standalone_apm_server
for reproducible benchmark with moxy and without using ESS: terraform: support TBS in standalone_apm_server #15337Checklist
For functional changes, consider:
How to test these changes
Enable TBS, try various sampling policies, send events, keep it running for over 2 * TTL, ensure that disk usage is bounded, and memory usage is expected.
Related issues
Fixes #15246